{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to the scikit-learn API for Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is to introduce the reader to the scikit API for Machine Learning. We will look at asimple example, to understand the main pipeline followed in the manuscript.The interest reader might deepen this concepts by reading the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.metrics import accuracy_score\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a toy example. This is represented by two variables, describing the featuers and target training vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = [{'airport':'CDG', 'month':'October'},\n",
    "     {'airport':'MPX', 'month':'June'},\n",
    "     {'airport':'VLC', 'month':'February'},\n",
    "     {'airport':'CDG', 'month':'December'}]\n",
    "\n",
    "y_train = ['rain', 'rain', 'sun', 'rain']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we have to deal with categorical variables, we use the pipeline object from scikit-learn. We combine a vectorizer (which basically converts categorical into numerical vectors), and a classifier (here the Linear Support Vector Classifier, `LinearSVC`). Then we call `fit` to train all the components in the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "         steps=[('dictvectorizer',\n",
       "                 DictVectorizer(dtype=<class 'numpy.float64'>, separator='=',\n",
       "                                sort=True, sparse=True)),\n",
       "                ('linearsvc',\n",
       "                 LinearSVC(C=1.0, class_weight=None, dual=True,\n",
       "                           fit_intercept=True, intercept_scaling=1,\n",
       "                           loss='squared_hinge', max_iter=1000,\n",
       "                           multi_class='ovr', penalty='l2', random_state=None,\n",
       "                           tol=0.0001, verbose=0))],\n",
       "         verbose=False)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipeline = make_pipeline( DictVectorizer(), LinearSVC() )\n",
    "pipeline.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We apply our classifier to a test set to make `prediction` on unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = [{'airport':'JFK', 'month':'June'},\n",
    "         {'airport':'CDG', 'month':'November'},\n",
    "         {'airport':'MPX', 'month':'June'},\n",
    "         {'airport':'MPX', 'month':'November'}]\n",
    "\n",
    "y_test = ['rain', 'rain', 'sun', 'rain']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['rain', 'rain', 'rain', 'rain'], dtype='<U4')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred = pipeline.predict(X_test)\n",
    "y_pred"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We finally evaluate the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.75"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(y_test, y_pred)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
